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Abstract.  We  present  a  new  scale-invariant  cost  function  for  non-orthogonal 
joint-diagonalization  of  a  set  of  symmetric  matrices  with  application  to 
Independent  Component  Analysis  (ICA).  We  derive  two  gradient  mini¬ 
mization  schemes  to  minimize  this  cost  function.  We  also  consider  their 
performance  in  the  context  of  an  ICA  algorithm  based  on  non-orthogonal 
joint  diagonalization. 

1  Introduction 

Simultaneous  or  Joint  Diagonalization  (JD)  of  a  set  of  estimated  statistics  matrices  is 
a  part  of  many  algorithms  especially  for  ICA  both  in  its  standard  and  non-stationary 
formulations.  Historically,  the  first  methods  developed  for  JD  were  those  that  assume 
the  joint  diagonalizer  to  belong  to  the  compact  Lie  group  of  orthogonal  matrices 
0(n)[5]  and  see  also  [1].  Accordingly  the  JD  problem  is  defined  as  a  minimization 
of  a  function  like: 

n 

Ji(0)  =  Y,  II OCt0T  -  diag((9Cj(9T)||^  (1) 

1=1 

where  {Ci}f=l  are  the  set  of  symmetric  matrices  to  be  diagonalized  and  O  G  0(n)  is 
the  joint  diagonalizer  sought.  We  remind  that  due  to  compactness  of  O(n)  we  know 
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in  advance  that  -J\  (0)  has  a  minimum  on  O(n).  Different  methods  for  minimization 
of  this  cost  function  in  the  context  of  Jacobi  methods  [5], [3]  and  optimization  on 
manifolds  have  been  proposed  among  them  [11], [13]. 

In  many  cases  it  is  believed  that  non-orthogonal  JD  is  more  efficient  in  the  context 
of  noisy  ICA.  In  fact  for  the  standard  ICA  model: 

Xnx  1  =  AnXnSnx  1  T  H  (2) 

with  n  a  Gaussian  noise  vector,  we  know  that  all  the  cumulant  slices  of  x  of  order 
higher  than  two  are  diagonalizable  by  an  un-mixing  matrix  in  congruence  manner. 
Here  by  un-mixing  matrix  (with  some  abuse  of  terminology)  we  mean  any  matrix 
B  such  that  BA  =  IJA  where  77  is  a  permutation  and  yl  is  a  non-singular  diagonal 
matrix.  If  is  a  subset  of  cumulant  matrix  slices  of  x  of  order  higher  than  two 

(which  are  symmetric  n  x  n  matrices)  and  B  is  an  un-mixing  matrix  belonging  to 
the  Lie  group  of  non-singular  matrices  GL(n)  then  BCiBT,s  all  are  diagonal.  The 
celebrated  JADE  algorithm  [5]  uses  this  fact  for  the  whitened  signal  for  which  the 
un-mixing  matrix  is  forced  or  assumed  to  be  orthogonal. 

Defining  a  suitable  cost  function  for  non-orthogonal  JD  seems  to  be  difficult  due 
to  non-compactness  of  GL(n).  For  a  suitable  cost  function  J(B)  we  expect  scale- 
invariance  that  is:  -J\  (AT?)  =  J\  (B)  for  any  non-singular  diagonal  matrix  A.  Note 
that  mutual  information  also  has  this  property:  for  any  random  vector  xnxi  and 
Axrexi  have  the  same  mutual  information.  In  [14]  and  [15]  a  cost  function  of  the 
form 

N 

MW,  {*}£,)  =  J2  HCi  -  wa,wt\\2f  (3) 

i— 1 

is  introduced  where  A*  are  diagonal  and  W  is  the  estimated  mixing  matrix.  Note 
that  the  minimization  would  be  over  W  and  A*,  therefore  n2  +  Nn  variables  are 
involved  whereas  the  original  problem  is  in  fact  a  search  in  n2  —  n  variables  (i.e. 
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finding  the  un-mixing  or  mixing  matrix  up  to  row  scaling).  This  cost  function 
encompasses  the  scale-invariability  by  introducing  diagonal  At.  In  fact  in  the  se¬ 
quel  we  will  follow  the  same  path,  but  we  will  choose  a  special  At.  Note  that  the 
single  term  ||C)  —  WMjkGr||]j  is  minimized  when  W  is  a  diagonalizer  of  C\  and 
At  =  diag ( W ~ 1  C'j W ~T)( see  (6)  below).  In  [1]  and  [2]  we  consider  using  the  same 
cost  function  as  J\{B)  with  B  G  GL(n)  for  non-orthogonal  JD.  The  problem  with 
J\{B)  is  that  it  is  not  scale- invariant  and  in  fact  can  be  reduced  as  ||5||  — >  0.  We 
showed  that  -J\  (B)  has  no  stationary  point  when  CV s  do  not  have  an  exact  joint 
diagonalizer.  Based  on  this  observation  we  provide  some  measures  to  deal  with  this 
issue  which  essentially  results  in  a  gradient  flow  of  the  form: 


B  =  - AB ,  5(0)  =  Inxn  (4) 

where  A  can  be  derived  from  H  =  ^  ( BCiBT  —  diag [BCiBT)) BCiBT^j  by 

either  equating  diagonal  of  H  to  zero  or  by  equating  A  =  H  —  tr(H)/n.  Note  that  in 
both  cases  tr(Z\)  =  0.  The  first  choice  is  equivalent  to  identifying  AB  and  B  for  non¬ 
singular  A  and  the  second  one  is  equivalent  to  identifying  aB  and  B  for  a  G  M  —  {0}. 
The  reader  is  referred  to  the  companion  paper  [1]  for  further  discussions. 

In  [16]  also  J\  (B)  is  used  and  a  heuristic  minimization  method  based  on  the  idea 
multiplicative  updates  is  developed  which  uses  the  idea  of  zero-diagonal  in  the  up¬ 
dates.  In  [12]  the  cost  function: 


J3(B)=Y^  log 


.  det  diag (BCiB7 
'  det  BC,Br 


is  introduced  where  G*  are  required  to  be  positive  definite.  Note  that  in  the  case 
of  high-order  cumulant  slices  this  is  not  the  case.  Note  that  J3(AB)  =  J3(B )  hence 
it  is  scale-invariant.  This  cost  function  is  derived  based  on  the  maximum  likelihood 
estimation  of  correlation  matrices  of  Gaussian  vectors  [12]. 
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In  the  light  of  (3)  and  (5)  and  the  discussions  above  we  introduce  this  cost  function: 

N 

MB)  =  Y  || B~1(BCiBT  -  dia,g(BCiBT))B~T\\2F  = 

1=1 

N 

Y  II Ci  -  B-1dmg(BClBT)B~T\\2F  (6) 

i=l 

Note  that  J±(AB)  =  MB)  and  there  is  no  condition  on  CV s  needed.  Needless  to 
say  that  this  cost  function  is  formed  merely  on  two  basis;  first,  scale-invariance  and, 
second  that  in  the  case  that  C£ s  have  an  exact  joint  diagonalizer  the  J4  can  become 
zero.  Therefore  its  applicability  and  further  justification  should  be  investigated. 
Note  that  the  first  formulation  for  J4  shows  somehow  a  ”  normalized”  version  of 
Ji{B)  and  the  second  formulation  relates  to  J2  with  a  specific  A,  and  using  the  un¬ 
mixing  matrix  instead  of  the  mixing  matrix.  One  immediate  problem  that  seems  to 
be  unavoidable  is  the  presence  of  B in  J±(B)  which  might  make  the  computations 
costly.  Note  that  J3  also  includes  terms  that  are  not  easy  to  compute  but  in  [12]  an 
approximation  is  used  to  avoid  this  direct  computation.  In  the  remainder  we  will 
derive  a  gradient  flow  for  minimization  of  J±(B).  In  section  (3)  we  also  derive  an 
adjoint  equation  for  minimization  of  J±(B)  with  respect  to  B _1,  based  on  this  we 
suggest  a  discrete  algorithm  for  minimization  of  J±(B)  that  does  not  require  explicit 
computation  of  B at  each  step.  In  section  (4)  we  will  use  the  derived  JD  algorithms 
in  an  ICA  algorithm  introduced  in  [2]  and  consider  some  numerical  results. 

Notation  We  already  used  the  notation  diag(A)  as  the  diagonal  part  of  A.  ||A||^ 
denotes  the  Frobenius  norm  of  matrix  A.  tr(A)  is  the  trace  of  the  matrix  A.  x 
shows  the  time  derivative  of  the  variable  x.  TPM  represents  the  tangent  space  to  the 
manifold  M  at  point  p.  InXn  is  the  identity  matrix  of  dimension  n  x  n.  All  random 
variables  are  in  boldface  small  letters. 
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2  Gradient  Flow  for  Minimization  of  J±(B) 


We  consider  GL(n)  as  a  Riemannian  manifold  with  the  Riemannian  metric  (also 
known  as  Natural  Riemannian  metric  [8]): 

(Z,V)B  =  tr  ((£B-1)T  riB-1)  =  tr  (TT^iT1)  =  tr(- rj{BTB)-'?)  (7) 


for  G  rj  e  TBGL(n).  Employing  the  relation  R-1  =  —  B  1BB  T  we  can  show  that 
the  gradient  flow  for  minimization  of  J\{B)  with  respect  to  this  Riemannian  metric 


is: 

B  =  - AB 
where: 


^  (^diag {BCiBT)  -dia ,g(Vi)BCiBT)  ) B,  B{ 0)  =  Inxn  (8) 

i=  1  ' 

Vi  =  B~T (Ci  -  B-1dia,g(BCiBT)B~T)B-1  = 


(BBt)~1  ( BCiBT  -  diag (BCiBT))  (BBT)~1  (9) 

It  is  interesting  to  note  that  the  term  A  in  (8)  which  in  fact  belongs  to  the  Lie 
algebra  of  GL(n),  i.e  gl(n)  is  such  that  diag(Z\)  =  0.  We  recall  that  in  order  to  use 
Jj  (B)  for  non-orthogonal  JD  we  forced  the  diagonal  of  the  corresponding  A  to  be 
zero  (See  (4)  and  [1]  or  [2])  whereas  using  J4(R)  as  a  cost  function  we  naturally 
reach  to  a  flow  that  has  this  property  and  in  fact  is  a  flow  on  the  group  of  n  x  n 
matrices  with  unity  determinant  i.e.  SL(n).(we  remind  that  the  Lie  algebra  of  SL(n) 
is  the  set  of  n  x  n  matrices  with  zero  trace) 

The  Euler  discretization  of  (8)  with  small  enough  step  size  results  in  the  steepest 
descent  algorithm.  This  can  demonstrated  as: 

Bk+ 1  =  (. I-iikAk)Bk  =  (i  -  /j,k^Vikdiag(BkCiBT)  -diag(Vik)BkCiBl')  Bk  (10) 

'  i=  1  ' 
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where  Vlk  =  (BkB%)  1(BkCiB J  -  diag {BkCiBl))(BkBl)  1  and  B0  =  Inxn.  The 
step  size  jik  should  be  such  that  at  each  update  the  cost  function  is  reduced  and 
det  (Bk)  is  almost  unity.  Note  that  for  a  steepest  descent  algorithm  on  a  linear  space 
there  is  no  restriction  on  the  step  size  as  long  as  it  is  such  that  the  cost  is  reduced 
but  in  this  case  which  is  in  fact  constrained  or  equivalently  is  such  that  the  answer  is 
confined  to  the  manifold  SL(n)  the  step  size  should  be  such  the  updates  stay  on  the 
manifold  to  a  good  extend.  In  [1]  and  [2]  we  proposed  a  discretization  scheme  based 
on  the  LU  decomposition  of  B  that  keeps  the  updates  on  SL(n)  by  construction 
which  can  be  applied  to  the  present  case  too.  Here,  however  we  choose  to  pick  a 
small  and  fixed  step  size  for  discretization.  Of  course  adaptive  step-size  methods 
with  an  eye  on  keeping  the  updates  on  SL(n)  will  result  in  faster  algorithms.  The 
pseudo  code  for  this  algorithm  is: 

Algorithm  1 

1.  Set  //  and  e. 

2.  Set  B0  =  Inxn  or  to  a  good  initial  guess. 

3.  While  ||Afc||F  >  e  do  Bk+l  =  (/  -  fiAk)Bk 

if  ||Hfc+1||^  is  ”  big'1  then  reduce  [i  and  goto  2. 

4.  End 

3  An  Inverse-Free  Algorithm 

In  implementing  the  discretized  expression  (10)  we  ought  to  compute  B^1  or  (BkB^)-1 
which  can  be  costly.  However,  one  should  notice  that  in  the  context  of  ICA  the  num¬ 
ber  of  matrices  C%  i.e.  N  is  quite  large  compared  to  n  and  in  fact  if  all  the  fourth 
order  cumulant  slices  are  used  N  =  n2.  Even  when  a  subset  of  cumulant  slices  is 
used  usually  N  S>  n.  Therefore  the  cost  of  computing  B^1  is  comparable  (in  order 
of  magnitude)  to  that  of  the  rest  of  (10).  In  the  case  N  =  n2  the  complexity  of  com- 
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pitting  (10)  is  of  order  0(n4)  whereas  the  complexity  of  computing  B^1  is  of  order 
0(n3).  Hence  we  may  conclude  that  in  the  context  of  JD  for  ICA,  the  computation 
of  inverse  is  not  a  significant  burden  compared  to  the  rest  of  computations  required 
in  (10).  Still  we  may  find  it  interesting  to  develop  a  scheme  for  minimization  of  J4 
which  is  free  of  computing  inverses.  To  this  end  here  we  propose  a  method  of  adjoint 
equations,  ft  is  possible  to  show  that  the  gradient  flow  for  minimization  of  J4(B) 
with  respect  to  B  1  is: 

B~l  =  —B~lA  (11) 

where  A  is  as  in  (8).  This  flow  is  derived  with  respect  to  a  Riemannian  metric 
defined  as: 

(£,  v)B  =  t  r((H“10r  =  ti(^T  B~T  B^rj)  =  tr  (rj(BBT)-1£T)  (12) 

The  reason  for  selecting  this  metric  is  that  considering  J4  as  a  function  of  B_1 
we  expect  J4(B_1yl)  =  J4(B_1),  so  at  the  point  B~l  a  tangent  vector  of  the  form 
B~XA  is  suitable.  Notice  that  this  flow  can  found  from  (8)  via  the  equation  ^-B-1  = 
-B  'BB  '. 

We  consider  the  Euler  discretization  of  (11): 

Bk+1  =  Bk  (/ +  /dkAk),  B0  =  I  (13) 

where  Ak  has  the  same  expression  as  in  (10).  The  idea  then  is  to  use  (10)  to  update 
B2fc- 1  and  (13)  to  update  B,^1 .  In  practice  this  leads  to  an  update  for  the  inverse 
that  is  not  exactly  equal  to  inverse  of  Bk  therefore  for  clarity  we  replace  B^1  with 
Qk  in  the  expressions.  It  was  noted  in  practice  that  the  second  formulation  for  in 
(9)  performs  better  than  the  Erst  one  therefore  in  the  algorithm  the  former  is  used. 
Here  is  a  pseudo  code  for  this  algorithm: 

Algorithm  2 

Consider  the  set  of  symmetric  matrices. 
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1.  Set  n  and  e. 

2.  Set  B_ i  Q—l  B o  Qo  Inxn- 

3.  do{ 

$i2fc+i  =  Q2kQ2k(B2k-\CiBjk_1  —  diag(S2fc_iC'jS^._i))Q2fcQ2fe 
^2k+i  =  Sili  yri2k+idi&E(B2k-iCiB%k_1)  —  diag(Wi2k+1)B2k_1CiBjk_1 

B2k+ 1  =  (f  —  At^2fc+i)-B2fc_i 

^i2k+2  =  QlkChk (B2k+\CtB^k+]  -  diag(B2k+iCiBjk+1))Q2kQ2k 
^2k+2  —  Eh  ^i2k+2diag(B2k+iCiB^k+1)  —  diag(^2fc+2)-B2/c+iC'ii?^.+1 
Q2k+2  =  Q2k{I  +  l^^2k+2J  } 

if  ||-B2fc+i||F  or  || ^2fc+2 1| f  is  ”  big”  then  reduce  /i  and  goto  2. 

While  ||Z\2fc+1||F  or  ||Z\2/£+2||f  >  e 

4.  End 

4  Simulations 

Here  we  consider  the  performance  of  the  developed  non-orthogonal  JD  methods  in 
the  context  of  an  ICA  algorithm  introduced  in  [2]  which  its  salient  feature  is  that 
although  it  whitens  the  data  it  does  not  confine  the  search  space  afterwards  to  O(n). 
Consider  the  data  model  (2)  A  simplified  version  of  that  algorithm  is: 

1.  Whiten  x,  let  W  be  the  whitening  matrix,  compute  y  =  ffx. 

2.  Estimate  C  =  {Ci)f=1  a  subset  of  the  fourth  order  cumulant  slice  matrices  of  y. 

3.  Jointly  diagonalize  C  =  {Ci}E  by  a  non-orthogonal  matrix  BJDN  (using  any 
algorithm  like  Algorithms  1  or  2)  and  set  B  =  -BjfatWA 

4.  Compute  x  =  Bx 

Example:  Consider 

x  =  Asnxi  +  cm  (14) 
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where  n  is  zero  mean  Gaussian  noise  with  identity  correlation  matrix  then  a2  indi¬ 
cates  the  power  of  noise.  We  consider  n  —  5  sources  all  of  them  uniformly  distributed 
in  [— |,  |],  The  matrix  A  is  randomly  generated  and  truncated  to  integer  entries: 

"8-36  -16  5 

12  6  11  2  2 
A  =  -15  8  -12  -10  -9 

-14  7  0  14  -21 
5  12-1-8  0 

We  generate  T  =  3500  samples  of  data  and  mix  the  data  through  A.  Next  we 
run  four  ICA  algorithms.  Three  algorithms  NH-JD  [2]  and  Algorithm  1  and  Al¬ 
gorithm  2  in  addition  to  the  standard  JADE  are  applied  to  the  data.  N  =  n 2 
fourth  order  cumulant  matrix  slices  are  used.  The  NH-JD  algorithm  is  basically 
the  above  ICA  algorithm  whose  JD  part  is  the  implementation  of  minimization  of 
Ji(B)  via  the  flow  (4)  with  the  nonholonomic  constraint  A  =  H  —  diag (H).  The 
discretization  of  the  JD  part  in  NH-JD  has  exactly  the  same  structure  as  Algo¬ 
rithm  1.  For  NH-JD,  Algorithm  1  and  Algorithm  2 ,  fi  =  .01  and  e  =  .01  are  used. 
These  values  are  not  optimal,  they  were  chosen  based  on  few  tries.  Implementations 
all  are  in  MAT  LAB®  code  and  the  Ad  AT  LAB®  code  for  JADE  was  downloaded 
from:  “http://tsi.enst.fr/cardoso/icacentral/Algos” .  The  performance  measure  used 
is  the  distance  of  the  product  of  the  estimated  un-mixing  and  the  mixing  matrix, 
i.e.  P  =  BA,  from  essential  diagonahty: 


Index(P)  =  £(£ 


“  max*.  |pifc| 


-!)+£(£ 


“  maxjfe  \pkj\ 


For  each  value  of  o  the  experiment  is  run  k  =  100  times  and  the  performance  mea¬ 
sure  is  averaged  over  the  trials.  Figure  (1)  shows  the  results.  We  can  see  that  the 
introduced  algorithms  all  have  almost  the  same  performance  and  out-perform  the 
standard  JADE  in  high  level  Gaussian  noise.  The  run-time  for  these  algorithms  (in 
MAT  LAB®  code)  is  much  higher  than  JADE’s,  although  we  expect  faster  perfor- 
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Fig.  1.  Average  in-noise-performance  index  (every  point  is  averaged  over  100  trials)  of  different  JD  based  ICA  algorithms. 
The  average  Index(P)  is  plotted  versus  a . 


mance  in  low-level  codes  or  DSPs.  Part  of  this  slower  convergence  can  be  attributed 
to  the  nature  of  gradient  based  methods  which  have  linear  convergence. 

5  Conclusion 

We  introduced  a  new  scale-invariant  cost  function  for  non-orthogonal  JD.  We  also 
derived  a  gradient  minimization  scheme  for  that  cost  function.  To  avoid  computing 
matrix  inverses  we  introduced  an  inverse-free  version  of  the  algorithm  developed. 
We  examined  the  performance  of  the  developed  JD  in  the  context  of  an  ICA  algo¬ 
rithm  introduced  in  [2]  and  compared  the  performance  with  JADE’s.  The  developed 
methods  out-perform  JADE  in  high  level  Gaussian  noise. 
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